AITopics | Maricopa

Coccidioidomycosis, commonly known as Valley Fever, remains a significant public health concern in endemic regions of the southwestern United States. This study develops the first graph neural network (GNN) model for forecasting Valley Fever incidence in Arizona. The model integrates surveillance case data with environmental predictors using graph structures, including soil conditions, atmospheric variables, agricultural indicators, and air quality metrics. Our approach explores correlation-based relationships among variables influencing disease transmission. The model captures critical delays in disease progression through lagged effects, enhancing its capacity to reflect complex temporal dependencies in disease ecology. Results demonstrate that the GNN architecture effectively models Valley Fever trends and provides insights into key environmental drivers of disease incidence. These findings can inform early warning systems and guide resource allocation for disease prevention efforts in high-risk areas.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2507.10014

Country:

North America > United States > Arizona > Maricopa County (0.05)
North America > United States > California (0.04)
North America > United States > Wyoming (0.04)
(8 more...)

Genre: Research Report > New Finding (0.88)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Public Health (1.00)
Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DataMan: Data Manager for Pre-training Large Language Models

Peng, Ru, Yang, Kexin, Zeng, Yawen, Lin, Junyang, Liu, Dayiheng, Zhao, Junbo

arXiv.org Artificial IntelligenceMar-13-2025

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.

conference paper, consistency, criteria, (15 more...)

arXiv.org Artificial Intelligence

2502.19363

Country:

South America > Venezuela (0.14)
Europe > Norway (0.13)
North America > United States > Alabama (0.04)
(49 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Sports (1.00)
Law (1.00)
(8 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)

Add feedback

SatFlow: Generative model based framework for producing High Resolution Gap Free Remote Sensing Imagery

Irigireddy, Bharath, Bandaru, Varaprasad

arXiv.org Artificial IntelligenceFeb-3-2025

Frequent, high-resolution remote sensing imagery is crucial for agricultural and environmental monitoring. Satellites from the Landsat collection offer detailed imagery at 30m resolution but with lower temporal frequency, whereas missions like MODIS and VIIRS provide daily coverage at coarser resolutions. Clouds and cloud shadows contaminate about 55\% of the optical remote sensing observations, posing additional challenges. To address these challenges, we present SatFlow, a generative model-based framework that fuses low-resolution MODIS imagery and Landsat observations to produce frequent, high-resolution, gap-free surface reflectance imagery. Our model, trained via Conditional Flow Matching, demonstrates better performance in generating imagery with preserved structural and spectral integrity. Cloud imputation is treated as an image inpainting task, where the model reconstructs cloud-contaminated pixels and fills gaps caused by scan lines during inference by leveraging the learned generative processes. Experimental results demonstrate the capability of our approach in reliably imputing cloud-covered regions. This capability is crucial for downstream applications such as crop phenology tracking, environmental change detection etc.,

imagery, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2502.01098

Country: North America > United States > Arizona > Pinal County > Maricopa (0.04)

Genre: Research Report (0.84)

Industry:

Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.95)
Government (0.69)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

Dynamic Benchmarks: Spatial and Temporal Alignment for ADS Performance Evaluation

Chen, Yin-Hsiu, Scanlon, John M., Kusano, Kristofer D., McMurry, Timothy L., Victor, Trent

arXiv.org Artificial IntelligenceOct-11-2024

Deployed SAE level 4+ Automated Driving Systems (ADS) without a human driver are currently operational ride-hailing fleets on surface streets in the United States. This current use case and future applications of this technology will determine where and when the fleets operate, potentially resulting in a divergence from the distribution of driving of some human benchmark population within a given locality. Existing benchmarks for evaluating ADS performance have only done county-level geographical matching of the ADS and benchmark driving exposure in crash rates. This study presents a novel methodology for constructing dynamic human benchmarks that adjust for spatial and temporal variations in driving distribution between an ADS and the overall human driven fleet. Dynamic benchmarks were generated using human police-reported crash data, human vehicle miles traveled (VMT) data, and over 20 million miles of Waymo's rider-only (RO) operational data accumulated across three US counties. The spatial adjustment revealed significant differences across various severity levels in adjusted crash rates compared to unadjusted benchmarks with these differences ranging from 10% to 47% higher in San Francisco, 12% to 20% higher in Maricopa, and 7% lower to 34% higher in Los Angeles counties. The time-of-day adjustment in San Francisco, limited to this region due to data availability, resulted in adjusted crash rates 2% lower to 16% higher than unadjusted rates, depending on severity level. The findings underscore the importance of adjusting for spatial and temporal confounders in benchmarking analysis, which ultimately contributes to a more equitable benchmark for ADS performance evaluations.

benchmark, crash rate, san francisco, (16 more...)

arXiv.org Artificial Intelligence

2410.08903

Country:

North America > United States > California > San Francisco County > San Francisco (0.59)
North America > United States > California > Los Angeles County > Los Angeles (0.29)
North America > United States > Arizona > Maricopa County > Phoenix (0.14)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.89)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

Add feedback

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

Ge, Junyao, Zheng, Yang, Guo, Kaitai, Liang, Jimin

arXiv.org Artificial IntelligenceAug-26-2024

Abundant, well-annotated multimodal data in remote sensing are pivotal for aligning complex visual remote sensing (RS) scenes with human language, enabling the development of specialized vision language models across diverse RS interpretation tasks. However, annotating RS images with rich linguistic semantics at scale demands expertise in RS and substantial human labor, making it costly and often impractical. In this study, we propose a workflow that leverages large language models (LLMs) to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform. This approach facilitates the generation of paired remote sensing data and can be readily scaled up using openly available data. Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions. Extensive experiments demonstrate that RSTeller enhances the performance of multiple existing vision language models for RS scene understanding through continual pre-training. Our methodology significantly reduces the manual effort and expertise needed for annotating remote sensing imagery while democratizing access to high-quality annotated data. This advancement fosters progress in visual language modeling and encourages broader participation in remote sensing research and applications. The RSTeller dataset is available at https://github.com/SlytherinGe/RSTeller.

caption, dataset, image patch, (16 more...)

arXiv.org Artificial Intelligence

2408.14744

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Oregon (0.04)
North America > United States > Mississippi (0.04)
(6 more...)

Genre: Research Report > New Finding (0.66)

Industry: Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Benchmarks for Retrospective Automated Driving System Crash Rate Analysis Using Police-Reported Crash Data

Scanlon, John M., Kusano, Kristofer D., Fraade-Blanar, Laura A., McMurry, Timothy L., Chen, Yin-Hsiu, Victor, Trent

arXiv.org Artificial IntelligenceDec-20-2023

With fully automated driving systems (ADS; SAE level 4) ride-hailing services expanding in the US, we are now approaching an inflection point, where the process of retrospectively evaluating ADS safety impact can start to yield statistically credible conclusions. An ADS safety impact measurement requires a comparison to a "benchmark" crash rate. This study aims to address, update, and extend the existing literature by leveraging police-reported crashes to generate human crash rates for multiple geographic areas with current ADS deployments. All of the data leveraged is publicly accessible, and the benchmark determination methodology is intended to be repeatable and transparent. Generating a benchmark that is comparable to ADS crash data is associated with certain challenges, including data selection, handling underreporting and reporting thresholds, identifying the population of drivers and vehicles to compare against, choosing an appropriate severity level to assess, and matching crash and mileage exposure data. Consequently, we identify essential steps when generating benchmarks, and present our analyses amongst a backdrop of existing ADS benchmark literature. One analysis presented is the usage of established underreporting correction methodology to publicly available human driver police-reported data to improve comparability to publicly available ADS crash data. We also identify important dependencies in controlling for geographic region, road type, and vehicle type, and show how failing to control for these features can bias results. This body of work aims to contribute to the ability of the community - researchers, regulators, industry, and experts - to reach consensus on how to estimate accurate benchmarks.

benchmark, crash rate, vehicle, (15 more...)

arXiv.org Artificial Intelligence

2312.13228

Country:

North America > United States > California > San Francisco County > San Francisco (0.15)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Michigan (0.04)
(7 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Automobiles & Trucks (1.00)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)

Add feedback

Clustering US Counties to Find Patterns Related to the COVID-19 Pandemic

Brown, Cora, Milstein, Sarah, Sun, Tianyi, Zhao, Cooper

arXiv.org Artificial IntelligenceMar-19-2023

When COVID-19 first started spreading and quarantine was implemented, the Society for Industrial and Applied Mathematics (SIAM) Student Chapter at the University of Minnesota-Twin Cities began a collaboration with Ecolab to use our skills as data scientists and mathematicians to extract useful insights from relevant data relating to the pandemic. This collaboration consisted of multiple groups working on different projects. In this write-up we focus on using clustering techniques to help us find groups of similar counties in the US and use that to help us understand the pandemic. Our team for this project consisted of University of Minnesota students Cora Brown, Sarah Milstein, Tianyi Sun, and Cooper Zhao, with help from Ecolab Data Scientist Jimmy Broomfield and University of Minnesota student Skye Ke. In the sections below we describe all of the work done for this project. In Section 2, we list the data we gathered, as well as the feature engineering we performed. In Section 3, we describe the metrics we used for evaluating our models. In Section 4, we explain the methods we used for interpreting the results of our various clustering approaches. In Section 5, we describe the different clustering methods we implemented. In Section 6, we present the results of our clustering techniques and provide relevant interpretation. Finally, in Section 7, we provide some concluding remarks comparing the different clustering methods.

artificial intelligence, cluster 0, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2303.11936

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Michigan > Wayne County > Wayne (0.04)
North America > United States > Texas > Dallas County > Dallas (0.04)
(26 more...)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Epidemiology (0.86)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.63)
Health & Medicine > Therapeutic Area > Immunology (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

High-resolution synthetic residential energy use profiles for the United States

Thorve, Swapna, Baek, Young Yun, Swarup, Samarth, Mortveit, Henning, Marathe, Achla, Vullikanti, Anil, Marathe, Madhav

arXiv.org Artificial IntelligenceDec-15-2022

Efficient energy consumption is crucial for achieving sustainable energy goals in the era of climate change and grid modernization. Thus, it is vital to understand how energy is consumed at finer resolutions such as household in order to plan demand-response events or analyze impacts of weather, electricity prices, electric vehicles, solar, and occupancy schedules on energy consumption. However, availability and access to detailed energy-use data, which would enable detailed studies, has been rare. In this paper, we release a unique, large-scale, digital-twin of residential energy-use dataset for the residential sector across the contiguous United States covering millions of households. The data comprise of hourly energy use profiles for synthetic households, disaggregated into Thermostatically Controlled Loads (TCL) and appliance use. The underlying framework is constructed using a bottom-up approach. Diverse open-source surveys and first principles models are used for end-use modeling. Extensive validation of the synthetic dataset has been conducted through comparisons with reported energy-use data. We present a detailed, open, high resolution, residential energy-use dataset for the United States.

data mining, household, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2210.08103

Country:

North America > United States > District of Columbia > Washington (0.14)
North America > United States > Texas > Travis County > Austin (0.14)
North America > United States > Montana (0.14)
(27 more...)

Genre: Research Report > Experimental Study (0.93)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Energy > Power Industry (1.00)
Transportation > Ground > Road (0.68)
Energy > Renewable > Solar (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Software (0.87)

Add feedback